In order to study the ways in which our country’s weather has behaved in the past years, we decided to look for a data set on different sources from the Internet. Unfortunately, we have not found any data set containing the metrics we need, therefore we decided to use the Visual Crossing Weather API, which would give us the weather conditions in Romania in a set of locations and time periods we chose to specify.
The period of time has been set to a daily basis from Jan 1st 2011, until Dec 31st 2021. This choice has been made due to financial limitations we encountered on the API’s part. In case that we had wanted to procure more data, more requests would have been necessary to be made to the public API, exceeding the threshold of free requests.
For each request, the API will return a single .csv file, respecting the name convention weather_YEAR_COUNTY.csv. So, for 41 Romanian counties and 11 years, we would obtain 451 csv files. However, we need to work with a single data set, so we decided to merge all of them into a single one in the code block below.
library(tidyverse)## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library(plotly)##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# TRUE only at first when "weather_2011-2021_Romania.csv" is not present in our project
MERGE_ALL_DATASETS = FALSE
if (MERGE_ALL_DATASETS) {
filenames_list = list.files(
path = "../weather_data/",
pattern = "*.csv",
full.names = TRUE
)
data = lapply(filenames_list, read_csv) %>% bind_rows()
write.csv(data, "weather_2011-2021_Romania.csv", row.names = FALSE, na = "")
}
data = read_csv("weather_2011-2021_Romania.csv", show_col_types = FALSE)
rm(MERGE_ALL_DATASETS)TODO: Describe the meaning of each relevant column using this documentation
The data set we obtained is mostly in a good shape. However, on a quick scan, we identified a few aspects that required some attention before diving into the exploratory analysis (e.g. splitting complex columns into multiple simpler ones).
In order to add value to the data set, we also decided to do some feature engineering, and create additional columns that would enhance the diversity of the data types. This diversity will allow for creating new types of plots and open multiple questions with possibly more surprising insights in the Romanian weather.
We start by removing blank spaces from column names for easier writing of the R code (e.g. “Minimum Temperature” becomes “MinimumTemperature”) and redundant columns ResolvedAddress and Name.
names(data) = gsub(' ', '', names(data))
data = data %>% select(-c("ResolvedAddress", "Name"))The data set contains two columns named Weather Type and Conditions that contain strings of different weather conditions identified at a specific time and location, each condition being separated by a comma. However, there can be multiple conditions reported in a day, so we decided that, instead of having a single column describing all of the conditions, we could create a separate column for each weather type and condition.
As a result, all of these new columns will be of boolean type, describing whether the weather has been or not identified in the given condition. There are 37 possible weather types and 4 possible conditions in the data set, therefore 41 new columns will be created.
accumulate_weather_func = function(accumulated_types, weather) {
if (!is.na(weather)) {
conditions = strsplit(weather, ", ")[[1]]
accumulated_types = union(accumulated_types, conditions)
}
return(accumulated_types)
}
# Compute the list of possible weather types in this data set
weather_types = reduce(data$WeatherType, accumulate_weather_func, .init = c())
cat("Number of possible weather types:", length(weather_types))## Number of possible weather types: 37
# Create the weather type columns
for (weather_type in weather_types) {
column_name = gsub(' ', '', weather_type) # remove spaces
column_name = gsub('/', 'Or', column_name) # replace slashes with Or
data[[column_name]] = with(
data,
# If Weather is NA, then new column will contain NA as well
if_else(
is.na(WeatherType),
NA,
# If Weather is defined, check that condition is present => set TRUE/FALSE
if_else(
grepl(weather_type, WeatherType, fixed = TRUE),
TRUE,
FALSE
)
)
)
}
rm(column_name, weather_type)
# Display an example of some weather columns
data %>%
slice(42:46) %>%
select(c(WeatherType, Fog, LightRain, SnowShowers, Duststorm))## # A tibble: 5 x 5
## WeatherType Fog LightRain SnowShowers Duststorm
## <chr> <lgl> <lgl> <lgl> <lgl>
## 1 Mist, Fog, Light Rain TRUE TRUE FALSE FALSE
## 2 Snow Showers, Mist, Rain, Rain Showers,~ FALSE TRUE TRUE FALSE
## 3 Mist FALSE FALSE FALSE FALSE
## 4 <NA> NA NA NA NA
## 5 Mist FALSE FALSE FALSE FALSE
# Compute the list of possible conditions in this data set
conditions = reduce(data$Conditions, accumulate_weather_func, .init = c())
cat("Number of possible conditions:", length(conditions))## Number of possible conditions: 4
# Create the conditions columns
for (condition in conditions) {
column_name = paste0("Condition", condition)
if (condition == "Partially cloudy") {
column_name = "ConditionPartiallyCloudy"
}
data[[column_name]] = with(
data,
# If Condition is NA, then new column will contain NA as well
if_else(
is.na(Conditions),
NA,
# If Condition is defined, check that condition is present => set TRUE/FALSE
if_else(
grepl(condition, Conditions, fixed = TRUE),
TRUE,
FALSE
)
)
)
}
rm(column_name, condition)
# Display an example of the conditions columns
data %>%
slice(11:15) %>%
select(c(Conditions, ConditionClear, ConditionPartiallyCloudy, ConditionOvercast, ConditionRain))## # A tibble: 5 x 5
## Conditions ConditionClear ConditionPartia~ ConditionOverca~ ConditionRain
## <chr> <lgl> <lgl> <lgl> <lgl>
## 1 Clear TRUE FALSE FALSE FALSE
## 2 Partially clou~ FALSE TRUE FALSE FALSE
## 3 Overcast FALSE FALSE TRUE FALSE
## 4 Overcast FALSE FALSE TRUE FALSE
## 5 Rain, Overcast FALSE FALSE TRUE TRUE
As the data is currently offered, the wind direction is expressed as a real number, whose value ranges between 0 and 360 degrees. We can express this value under a categorical form, by assigning a named direction for each specific division on the trigonometric circle. In this way, we can obtain 16 different categories for 16 subdivisions, as can be checked in the figure below.
wind_direction_table = data.frame(
"Cardinal_Point" = c("N", "NNE", "NE", "ENE", "E", "ESE", "SE", "SSE", "S", "SSW", "SW", "WSW", "W", "WNW", "NW", "NNW"),
"From" = c(348.75, seq(11.25, 326.25, 22.5)),
"To" = seq(11.25, 348.75, 22.5)
)Table for Wind Cardinal Directions and Degrees
Using the above table, we can take each value from the WindDirection column and build the new column WindCardinalDirection.
# Break down the numeric direction into more categories
data = data %>%
mutate(
WindCardinalDirection = cut(
x = WindDirection,
breaks = c(0, wind_direction_table$To, 360),
labels = c(wind_direction_table$Cardinal_Point, "N")
)
)
# Display an example of some wind directions
data %>%
slice_sample(n=5) %>%
select(c(WindDirection, WindCardinalDirection))## # A tibble: 5 x 2
## WindDirection WindCardinalDirection
## <dbl> <fct>
## 1 290. WNW
## 2 208. SSW
## 3 140. SE
## 4 238. WSW
## 5 205. SSW
The data set presents a column called Visiblity that represents the distance, measured in kilometers, at which an object or light can be clearly discerned.
As mentioned by the National Weather Service at this location, we can break down these distances into 4 sets, creating a new column called VisibilityLevel, which describes an ordered categorical variable.
The 4 divisions are described as being the following:
data = data %>%
mutate(
VisibilityLevel = cut(
x = Visibility,
breaks = c(0, 0.926, 3.704, 9.26, +Inf),
labels = c("Very Poor", "Poor", "Moderate", "Good"),
include.lowest = TRUE
)
)
# Display an example of some visibility levels
data %>%
slice_sample(n=5) %>%
select(c(Visibility, VisibilityLevel))## # A tibble: 5 x 2
## Visibility VisibilityLevel
## <dbl> <fct>
## 1 10.2 Good
## 2 17 Good
## 3 14.6 Good
## 4 10 Good
## 5 10 Good
The column Address contains the name of each county seat in our country, followed by , Romania, but we would like to have a column called County that contains only the name of the county (not the county seat) to use while plotting the Romania’s map. It is important that the county names computed here to be the same as the ones employed in the geoJson file containing data about drawing the counties’ borders (e.g. must rename Bucharest to Bucuresti).
romania_counties = c("Alba","Arad","Arges","Bacau","Bihor","Bistrita-Nasaud","Botosani","Brasov","Braila","Bucuresti","Buzau","Caras-Severin","Calarasi","Cluj","Constanta","Covasna","Dambovita","Dolj","Galati","Giurgiu","Gorj","Harghita","Hunedoara","Ialomita","Iasi","Maramures","Mehedinti","Mures","Neamt","Olt","Prahova","Satu Mare","Salaj","Sibiu","Suceava","Teleorman","Timis","Tulcea","Vaslui","Valcea","Vrancea")
romania_county_seats = c("Alba Iulia","Arad","Pitesti","Bacau","Oradea","Bistrita","Botosani","Brasov","Braila","Bucuresti","Buzau","Resita","Calarasi","Cluj-Napoca","Constanta","Sfantu Gheorghe","Targoviste","Craiova","Galati","Giurgiu","Targu Jiu","Miercurea Ciuc","Deva","Slobozia","Iasi","Baia Mare","Drobeta-Turnu Severin","Targu Mures","Piatra Neamt","Slatina","Ploiesti","Satu Mare","Zalau","Sibiu","Suceava","Alexandria","Timisoara","Tulcea","Vaslui","Ramnicu Valcea","Focsani")
# First create a column CountySeat, translate the special Bucharest case,
# and then, based on the above lists, create the final column County
data = data %>%
mutate(CountySeat = gsub(', Romania', '', Address)) %>%
mutate(
CountySeat = ifelse(
CountySeat == "Bucharest",
"Bucuresti",
CountySeat
)
) %>%
mutate(
County = plyr::mapvalues(
x = CountySeat,
from = romania_county_seats,
to = romania_counties
)
)
# Display an example of several counties
data %>%
slice_sample(n=5) %>%
select(c(Address, CountySeat, County))## # A tibble: 5 x 3
## Address CountySeat County
## <chr> <chr> <chr>
## 1 Tulcea, Romania Tulcea Tulcea
## 2 Braila, Romania Braila Braila
## 3 Miercurea Ciuc, Romania Miercurea Ciuc Harghita
## 4 Constanta, Romania Constanta Constanta
## 5 Baia Mare, Romania Baia Mare Maramures
We were not sure what the API will return when calling it to get Ilfov’s weather (might have returned Bucharest’s weather, which is inside it), so we decided to pass this county’s API calls in order to allow for other free calls for other counties.
In this project we will suppose that Ilfov and Bucharest share the same weather.
ilfov_data = data %>%
filter(County == "Bucuresti") %>%
mutate(
Address = "Ilfov, Romania",
County = "Ilfov",
CountySeat = "Bucuresti",
Latitude = 44.5355,
Longitude = 26.2324
)
data = rbind(data, ilfov_data)
rm(ilfov_data)First we define the data set portion that we are going to use in this analysis.
# Select relevant columns for this analysis.
data_wt_cond = data %>%
select(
"Datetime", "County", "Latitude", "Longitude",
"Mist":"HeavyFreezingRain",
"ConditionPartiallyCloudy":"ConditionRain"
)
wt_cols = 5:41
cond_cols = 42:45There is a wide variety of weather types that can be identified in Romania during the years’ four seasons. We have identified 37 special ones, as labeled by the Visual Crossing Weather API, but what is the most common unusual weather in Romania? We could first take a look at the extact number of occurrences for each weather type.
# Create data frame for the first plot
data_wt_cond_1 = data_wt_cond %>%
summarise(across(all_of(wt_cols), sum, na.rm = TRUE)) %>%
gather(key = "WeatherType", value = "Occurrences") %>%
arrange(Occurrences) %>%
mutate(WeatherType = factor(WeatherType, WeatherType))
total_occurrences = (data_wt_cond_1 %>% summarise(sum(Occurrences)))[[1]]
data_wt_cond_1 = data_wt_cond_1 %>%
mutate(OccurrencesFreq = round(Occurrences / total_occurrences * 100, 3))
# Define text label function
get_occ_text = function(type_col, occ_col, occ_freq_col) {
texts = list()
for (i in 1:length(occ_col)) {
text = paste0(
"Weather: ", type_col[i], "\n",
occ_freq_col[i], "% of Unusual Weather\n",
occ_col[i], " observed occurrences"
)
texts = append(texts, text)
}
return(texts)
}
plot_wt_cond_1 = data_wt_cond_1 %>%
ggplot(aes(
x = WeatherType,
y = OccurrencesFreq,
text = get_occ_text(WeatherType, Occurrences, OccurrencesFreq)
)) +
geom_segment(
aes(x = WeatherType, xend = WeatherType, y = 0, yend = OccurrencesFreq),
color = "#82a3e0",
alpha = 1
) +
geom_point(color="#82a3e0", size=2, alpha=1) +
coord_flip() +
labs(
title = "Frequency of Unusual Weather Conditions in Romania",
x = "",
y = "% of unusual weather occurrences"
)
rm(data_wt_cond_1, total_occurrences)
ggplotly(plot_wt_cond_1, tooltip = "text")The lollipop plot from above tells us that the Mist weather has the biggest amount of occurrences during 2011-2021 when summing up all the Romania’s counties. This might be explained by the counties found in the Carpathian Mountains where this type of weather is usually more manifested, together with Rain, Fog, and Snow.
In order to confirm this hypothesis, we decided to plot the most unusual weather for each county and check the results in the following map.
library(geojsonio)## Registered S3 method overwritten by 'geojsonsf':
## method from
## print.geojson geojson
##
## Attaching package: 'geojsonio'
## The following object is masked from 'package:base':
##
## pretty
library(broom)
romania_df = geojson_read("romania.geojson", what = "sp")
romania_map = tidy(romania_df, region = "name")
data_wt_cond_2 = data_wt_cond %>%
select(c(County, all_of(wt_cols))) %>%
group_by(County) %>%
summarise(across(1:37, sum, na.rm = TRUE)) %>%
gather(key = "WeatherType", value = "Occurrences", -County) %>%
group_by(County) %>%
filter(Occurrences == max(Occurrences)) %>%
mutate(WeatherType = factor(WeatherType, WeatherType)) %>%
select(-Occurrences) %>%
ungroup()
map_wt_cond_2 = romania_map %>%
left_join(. , data_wt_cond_2, by = c("id" = "County"))
plot_wt_cond_2 = ggplot() +
geom_polygon(
data = map_wt_cond_2,
aes(fill = WeatherType, x = long, y = lat, group = group),
color="#dbdbdb"
) +
coord_map() +
theme_void() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_manual(
values = c("#b1b3c7", "#848bc4", "#888a99", "#343873"),
labels = c("Mist", "Sky Coverage Increasing", "Fog", "Rain")
) +
labs(
title = "Most Unusual Weather Conditions in Romania",
fill = "Weather Type"
)
rm(romania_df, data_wt_cond_2, map_wt_cond_2)
plot_wt_cond_2As it is the case here, it seems that the Mist weather is encountered mostly in the southern and eastern parts of the country, while the Oriental Carpathians present a Fog weather, with mostly Rainy days in Mures and Sibiu, inside the Transylvanian Plateau. Note that Mist is also a type of Fog, but one that contains less condensed water drops.
Here we look into the 4 conditions the data set has to offer. In the pie chart below you can see the overall distribution of the weather conditions in the entire period among all the Romania’s counties, and it looks like the most part of the weather is Partially Cloudy.
data_wt_cond_3 = data_wt_cond %>%
summarise(across(all_of(cond_cols), sum, na.rm = TRUE)) %>%
gather(key = "Condition", value = "Occurrences") %>%
arrange(desc(Occurrences)) %>%
mutate(Condition = factor(Condition, Condition)) %>%
mutate(ypos = cumsum(Occurrences) - 0.5 * Occurrences )
total_occurrences = (data_wt_cond_3 %>% summarise(sum(Occurrences)))[[1]]
data_wt_cond_3 = data_wt_cond_3 %>%
mutate(OccurrencesFreq = round(Occurrences / total_occurrences * 100, 2))
plot_wt_cond_3 = data_wt_cond_3 %>%
ggplot(aes(x = "", y = Occurrences, fill = Condition)) +
geom_bar(stat = "identity", width = 1, color="white") +
coord_polar("y", start = 0) +
theme_void() +
theme(plot.title = element_text(hjust = 0.5)) +
labs(title = "Distribution of registered Sky Conditions") +
scale_fill_manual(
values = c("#0c9fc7", "#041b8f", "#02d7e3", "#363e66"),
labels = c("Partially Cloudy", "Rain", "Clear", "Overcast")
) +
geom_text(
aes(y = total_occurrences - ypos, label = paste0(OccurrencesFreq, "%")),
color = "white",
size = 6
)
rm(data_wt_cond_3)
plot_wt_cond_3small_data = data %>% group_by(Address) %>% filter(row_number() == 1)
ggplot(small_data, aes(x=Latitude, y=Longitude)) +
geom_point(color="orange",
fill="#69b3a2",
shape=21,
alpha=0.5,
size=6,
stroke = 2)+geom_text(
label=rownames(small_data),
nudge_x = 0.25, nudge_y = 0.25,
check_overlap = T
)